Introduction via simple version

n_organism <- 10
baseline_indices <- 2:n_organism
exponential_indices <- 1
n_bp <- 100
k <- 40
n_kmer <- n_bp - k
time_window <- 14
baseline <- 100
exp_initial <- 1
r <- 0.4
read_depth <- 10^4

Suppose that there are 10 organisms in total in the sample, and nothing else. Each organism has a genome that is exactly 100 base-pairs long. Each genome is broken into \(k\)-mers of length 40. There are 60 \(k\)-mers from each organism, obtained by subtracting the \(k\)-mer length from the genome size. Assume that each \(k\)-mer is unique, and there are no sequencing errors. We collect data over a period of 14 days \(t = 0, 1, \dots\) at a single environmental monitoring site.

Organism-level baseline and exponential regimes

We consider two regimes 1) baseline and 2) exponential growth.

In the baseline regime, a species has a concentration \(c^{(b)}\) in water of 100 copy per \(\mu\)L1, independent of the day. \[ c^{(b)}_t = c^{(b)} \] In the exponential growth regime, a species starts at a concentration \(c^{(e)}_0\), and over the 14 day window its concentration increases exponentially according to \[ c^{(e)}_t = c^{(e)}_0 \exp(rt) \] where \(r\) is the growth rate which we take to be 0.4, and \(c^{(e)}_0\) is the initial exponential concentration which we take to be 1 – lower than the baseline concentration as we assume the exponentially increasing organism to be novel.

The true concentration of the baseline organisms over the time window is:

conc_baseline <- rep(baseline, time_window)
conc_baseline
##  [1] 100 100 100 100 100 100 100 100 100 100 100 100 100 100

And the true concentration of the exponentially increasing organism over the time window is:

conc_exponential <- exp_initial * exp(r * 0:13)
conc_exponential
##  [1]   1.000000   1.491825   2.225541   3.320117   4.953032   7.389056  11.023176  16.444647  24.532530  36.598234  54.598150
## [12]  81.450869 121.510418 181.272242

Suppose that organisms 2, 3, 4, 5, 6, 7, 8, 9, 10 follow the baseline behaviour and organism 1 follows the exponential behaviour. We will represent the true concentrations of each organism with a matrix C:

C <- matrix(
  data = NA,
  nrow = time_window,
  ncol = n_organism
)

C[, exponential_indices] <- conc_exponential
C[, baseline_indices] <- conc_baseline
C
##             [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   1.000000  100  100  100  100  100  100  100  100   100
##  [2,]   1.491825  100  100  100  100  100  100  100  100   100
##  [3,]   2.225541  100  100  100  100  100  100  100  100   100
##  [4,]   3.320117  100  100  100  100  100  100  100  100   100
##  [5,]   4.953032  100  100  100  100  100  100  100  100   100
##  [6,]   7.389056  100  100  100  100  100  100  100  100   100
##  [7,]  11.023176  100  100  100  100  100  100  100  100   100
##  [8,]  16.444647  100  100  100  100  100  100  100  100   100
##  [9,]  24.532530  100  100  100  100  100  100  100  100   100
## [10,]  36.598234  100  100  100  100  100  100  100  100   100
## [11,]  54.598150  100  100  100  100  100  100  100  100   100
## [12,]  81.450869  100  100  100  100  100  100  100  100   100
## [13,] 121.510418  100  100  100  100  100  100  100  100   100
## [14,] 181.272242  100  100  100  100  100  100  100  100   100

Sampling \(k\)-mers

We assume that the true concentration of each \(k\)-mer is that of the corresponding organism. This may be represented by copying each column of the matrix C 100 times.

K <- matrix(rep(as.numeric(t(C)), each = n_kmer), nrow = time_window, byrow = TRUE)

The matrix K now has 14 rows, one for each day, and 600 columns, one for each \(k\)-mer (of which there are 60 times 10). We can calculate proportions, where the total number of \(k\)-mers is given by the row sums:

K_norm <- t(apply(K, 1, function(x) x / sum(x), simplify = TRUE))
# useful::topleft(K_norm)
# useful::bottomleft(K_norm)

One way to represent the sequencing process is as a sample from the collection of \(k\)-mers. For example, we could consider a multinomial sample with probabilities of sampling each \(k\)-mer given by K_norm and sample size given by the read depth 10000.

To demonstrate this, suppose we simulate the sequencing process on day 1. The proportions of each \(k\)-mer are given by K_norm[1, ], and we may sample using rmultinom. A histogram of the \(k\)-mer counts at day 1 under each regime, showing that the exponential regime is initialised at low count numbers, is given by:

sample_one <- rmultinom(1, read_depth, K_norm[1, ])

get_regime <- function(organism_index) {
  if(organism_index %in% baseline_indices) return("baseline")
  if(organism_index %in% exponential_indices) return("exponential")
  else return(NA)
}

#' Testing that this function works as intended
stopifnot(purrr::map_chr(1:2, get_regime) == c("exponential", "baseline"))

data.frame(count = sample_one) %>%
  mutate(
    organism = rep(1:n_organism, each = n_kmer),
    regime = str_to_title(purrr::map_chr(organism, get_regime)),
  ) %>%
  ggplot(aes(x = count, group = regime, fill = regime)) +
    geom_histogram(alpha = 0.8) +
    labs(x = "k-mer count", y = "Occurances", fill = "Regime", title = "k-mer counts at day 1") +
    scale_fill_manual(values = cbpalette) +
    theme_minimal()

Now, we will take multinomial samples from all of the days with a call to apply:

sample <- apply(K_norm, 1, function(row) rmultinom(1, read_depth, row))

colnames(sample) <- paste0("day", 1:ncol(sample))
rownames(sample) <- paste0(1:nrow(sample))

sample_df <- sample %>%
  as.data.frame() %>%
  tibble::rownames_to_column("id") %>%
  pivot_longer(
    cols = starts_with("day"),
    names_to = "day",
    values_to = "count",
    names_prefix = "day"
  ) %>%
  mutate(
    id = as.numeric(id),
    day = as.numeric(day),
    kmer = rep(rep(1:n_kmer, each = time_window), times = n_organism),
    organism = rep(1:n_organism, each = n_kmer * time_window),
    regime = str_to_title(purrr::map_chr(organism, get_regime))
  )

saveRDS(sample_df, "sample0.rds")

The data frame sample_df filtered to the first \(k\)-mer from the first organism is given by:

Let’s plot the data from organism 1 which we have set to be exponentially increasing, together with the data from all other organisms which we have set to follow a baseline distribution:

reads_plot <- function(df) {
  
  sample_summary <- df %>%
    group_by(day, regime) %>%
    summarise(
      count_upper = quantile(count, 0.95),
      count_median = median(count),
      count_lower = quantile(count, 0.05)
    )

  ggplot(sample_summary, aes(x = day, ymin = count_lower, y = count_median, ymax = count_upper, group = regime)) +
    geom_ribbon(alpha = 0.1, aes(fill = regime)) +
    geom_line(aes(col = regime), size = 1.5) +
    geom_line(data = df, aes(x = day, y = count, col = regime, group = id),
               alpha = 0.05, inherit.aes = FALSE) + 
    theme_minimal() +
    scale_color_manual(values = cbpalette) +
    scale_fill_manual(values = cbpalette) +
    guides(fill = "none") +
    labs(x = "Day", y = "Number of reads in sample", col = "Regime")
}

reads_plot(sample_df)

sample_summary <- sample_df %>%
  group_by(day, regime) %>%
  summarise(
    count_upper = quantile(count, 0.95),
    count_median = median(count),
    count_lower = quantile(count, 0.05)
  )
pickout_day <- 10

With these settings, on day 10 the median count of organism 1 is 18, 6 and \(k\)-mers from organism 1 represent 10.8, 3.6% of the total reads.

Now, we will start to make this simulation more realistic by adding sources of noise.

Adding noise to the true abundances

In reality, the organism abundances will vary, neither being fixed to a constant or increasing deterministically. We will now suppose that the abundances vary stochastically as follows.

Lognormal geometric random walk

For the baseline concentration, one possible model is a lognormal geometric random walk such that \[ c^{(b)}_t = c^{(b)}_{t - 1} \times \exp \left( \epsilon^{(b)}_t \right) \] where \(\epsilon^{(b)}_t \sim \mathcal{N}(0, \sigma_b^2)\). Another way to calculate the baseline concentration at time \(t\), \(c^{(b)}_t\), is as \(c^{(b)}_t = c^{(b)}_{0} \times \exp \left( x^{(b)}_t \right)\) where \(x^{(b)}_t = \sum_{s = 1}^t \epsilon^{(b)}_s\) follows a random walk. The expected value and variance of this random walk are \[ \mathbb{E}[x^{(b)}_t] = \mathbb{E} \left[ \sum_{s = 1}^t \epsilon^{(b)}_s \right] = \sum_{s = 1}^t \mathbb{E} \left[ \epsilon^{(b)}_s \right] = 0, \\ \mathbb{Var} [ x^{(b)}_t ] = \mathbb{Var} \left[ \sum_{s = 1}^t \epsilon^{(b)}_s \right] = \sum_{s = 1}^t \mathbb{Var} \left[ \epsilon^{(b)}_s \right] = t\sigma_b^2 \] We can use the formula for the moment generating function of a Gaussian random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\), \(m_X(u) = \mathbb{E} \left[ e^{uX} \right] = \exp(\mu u + \sigma^2u^2 / 2)\), to calculate that the expected concentration at time \(t\) under this model is \[ \mathbb{E} \left[ c^{(b)}_t \right] = c^{(b)}_{0} \exp(\sigma_b^2/2) > c^{(b)}_{0}. \]

For the exponential regime, we could also suppose a geometric lognormal random walk \[ c^{(e)}_t = c^{(e)}_{t - 1} \times \exp \left( \epsilon^{(e)}_t \right) \] where \(\epsilon^{(e)}_t \sim \mathcal{N}(r, \sigma_e^2)\) and the growth rate \(r > 0\), and the expected concentration at time \(t\) under this model is \(\mathbb{E} \left[ c^{(e)}_t \right] = c^{(e)}_{0} \exp(rt + \sigma_e^2/2)\).

n_baseline_paths <- 10

To show how this behaviour looks, let’s simulate 10 samples from both the baseline and exponential regimes:

baseline_df <- data.frame(
  day = rep(1:time_window, times = n_baseline_paths), 
  epsilon = rnorm(time_window * n_baseline_paths),
  id = as.factor(rep(1:n_baseline_paths, each = time_window)),
  regime = "Baseline"
  ) %>%
  group_by(id) %>%
  arrange(id) %>%
  mutate(
    x = cumsum(epsilon),
    conc = baseline * exp(x)
  ) %>%
  ungroup()

exponential_df <- data.frame(
  day = rep(1:time_window, times = n_baseline_paths), 
  epsilon = rnorm(time_window * n_baseline_paths, mean = r),
  id = as.factor(rep(1:n_baseline_paths, each = time_window)),
  regime = "Exponential"
  ) %>%
  group_by(id) %>%
  arrange(id) %>%
  mutate(
    x = cumsum(epsilon),
    conc = baseline * exp(x)
  ) %>%
  ungroup()

df <- bind_rows(exponential_df, baseline_df)

epsilon_x_conc_plot <- function(df, title) {
  df %>%
    pivot_longer(
      cols = c("epsilon", "x", "conc"),
      names_to = "variable",
      values_to = "value"
    ) %>%
    mutate(
      variable = recode_factor(variable, 
        "epislon" = "epsilon", 
        "x" = "x",
        "conc" = "Concentration")
    ) %>%
    ggplot(aes(x = day, y = value, group = id, col = regime)) +
      geom_line(alpha = 0.5) +
      facet_wrap(variable ~ regime, scales = "free", ncol = 2) +
      scale_colour_manual(values = cbpalette) +
      theme_minimal() +
      guides(col = "none") +
      labs(x = "Day", y = "", title = title)
}

epsilon_x_conc_plot(df, "Lognormal geometric random walk")

The key takeaway here about the lognormal geometric random walk model is that even though the noise is IID and Gaussian, when you take the cumulative sum and exponentiate it can lead to large deviations in concentration, which follows from the variance of a random walk increasing linearly in time. The maximum concentration value under the baseline regime observed was 357173.8768472 – which is 3571.7387685 greater than the baseline concentration.

Lognormal geometric auto-regressive process for baseline

rho <- 0.75

To remedy the issue observed above, we could use a auto-regressive process for \(x^{(b)}_t\) rather than a random walk. Such a model is specified recursively by \[ x^{(b)}_1 \sim \left( 0, \frac{1}{1 - \rho^2} \right), \\ x^{(b)}_t = \rho x^{(b)}_{t - 1} + \epsilon^{(b)}_t, \quad t = 2, \ldots, T, \] where the lag-one correlation parameter \(\rho \in (0, 1)\) determines the extent of autocorrelation, which we take to be 0.75, and \(\epsilon^{(b)}_t \sim \mathcal{N}(0, 1)\) as before.

baseline_ar_df <- data.frame(
  day = rep(1:time_window, times = n_baseline_paths), 
  epsilon = rnorm(time_window * n_baseline_paths),
  id = as.factor(rep(1:n_baseline_paths, each = time_window)),
  regime = "Baseline"
  ) %>%
  group_by(id) %>%
  arrange(id) %>%
  mutate(
    x = stats::arima.sim(model = list(order = c(1, 0, 0), ar = rho), n = time_window, innov = epsilon),
    conc = baseline * exp(x)
  ) %>%
  ungroup()

df_ar <- bind_rows(exponential_df, baseline_ar_df)

epsilon_x_conc_plot(df_ar, "Lognormal geometric auto-regressive behaviour for the baseline")

We will now simulate and plot data for this model (lognormal geometric auto-regressive for the baseline, and lognormal geometric random walk for the exponential) with 10 organisms in total, each of which with 60 \(k\)-mers as before.

C <- matrix(
  data = NA,
  nrow = time_window,
  ncol = n_organism
)

C[, baseline_indices] <- df_ar %>%
  filter(
    regime == "Baseline",
    id %in% baseline_indices
  ) %>%
  select(day, id, conc) %>%
  pivot_wider(
    names_from = id,
    values_from = conc
  ) %>%
  select(-day) %>%
  as.matrix()

C[, exponential_indices] <- df_ar %>%
  filter(
    regime == "Exponential",
    id %in% exponential_indices
  ) %>%
  select(day, id, conc) %>%
  pivot_wider(
    names_from = id,
    values_from = conc
  ) %>%
  select(-day) %>%
  as.matrix()

C
##              [,1]      [,2]       [,3]       [,4]       [,5]        [,6]      [,7]       [,8]       [,9]      [,10]
##  [1,]    346.7262 106.53979   3.002622  20.326188  387.59663   0.5089494 37.191355  303.54078   36.33516 105.210357
##  [2,]    752.2905  36.36845   1.389456  43.734120  425.87768   2.0659825 33.277335  493.93055   70.71914 368.871569
##  [3,]   1541.0798  59.02423   3.671910  56.207372  107.12055   0.3609627 15.163185  373.79549   83.75784  38.249862
##  [4,]   2365.0461  61.87155  28.974830  87.287097   52.13955   3.5397057  4.802785 2240.08113  310.56468  48.830268
##  [5,]   3257.3992 671.69638  21.782217 155.221453 1218.43338   6.9867826 17.040303  816.17778  424.65590  19.422040
##  [6,]    914.7788  75.24056  17.092531  95.180084  297.37299  13.0198224  4.443300  291.72722  152.96636 112.391619
##  [7,]   4941.1849 360.23714  70.809749 289.689850   64.27330 146.2311566  8.628433  221.30107   33.40432  45.622797
##  [8,]  10006.2113 483.60075 123.717650 198.539368  130.58316 167.2179529 16.111401   81.48438   67.10901  33.611010
##  [9,]  18706.0098 391.61986  26.440082 128.070231  218.59745 662.0727995 47.821772  115.38227  180.99931  23.980864
## [10,]  31717.4725 158.00190  67.434347  78.466536  105.73715 376.9399587 21.671678   76.71931  150.47411   6.134126
## [11,]   4944.2423  36.94716  67.399611  45.027965  106.81603 211.0465284 14.168473  165.63571 1631.73090  10.044266
## [12,]  37434.8568  22.51381 905.375050 117.671849  212.82448 115.7327025  5.498178   84.71967 3016.58823  14.623175
## [13,] 274715.5658 126.77212 177.273310  35.912714  247.12146  58.9254270  6.735628  568.35010  898.16254  20.941630
## [14,] 331911.6934  12.03060  16.226346   9.866958  443.46443  35.0140262 54.618336  384.59222 2036.16913  45.443767
#' Bring together code from above
sample_kmers <- function(C, n_kmer, read_depth) {
  time_window <- nrow(C)
  n_organism <- ncol(C)
  K <- matrix(rep(as.numeric(t(C)), each = n_kmer), nrow = time_window, byrow = TRUE)
  K_norm <- t(apply(K, 1, function(x) x / sum(x), simplify = TRUE))
  sample <- apply(K_norm, 1, function(row) rmultinom(1, read_depth, row))
  colnames(sample) <- paste0("day", 1:ncol(sample))
  rownames(sample) <- paste0(1:nrow(sample))
  
  sample %>%
    as.data.frame() %>%
    tibble::rownames_to_column("id") %>%
    pivot_longer(
      cols = starts_with("day"),
      names_to = "day",
      values_to = "count",
      names_prefix = "day"
    ) %>%
    mutate(
      id = as.numeric(id),
      day = as.numeric(day),
      kmer = rep(rep(1:n_kmer, each = time_window), times = n_organism),
      organism = rep(1:n_organism, each = n_kmer * time_window),
      regime = str_to_title(purrr::map_chr(organism, get_regime))
    )
}

sample_df <- sample_kmers(C, n_kmer, read_depth)

saveRDS(sample_df, "sample1.rds")

reads_plot(sample_df)

Sequencing observation models

See Townes (2020) for a nice review of models for count data.

Dirichlet-multinomial model

Rather than assuming that during sequencing the probabilities of sampling each \(k\)-mer are given by the true proportions of the \(k\)-mer, we might want to add additional noise. One way to do this is to sample the proportions \(w\) from a probability distribution over the simplex such as the Dirichlet. Given a vector of positive real numbers \(\alpha = (\alpha_1, \ldots, \alpha_K)\) then \(w \sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_K)\) if \[ p(w; \alpha) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} w_1^{\alpha_1 - 1} w_2^{\alpha_2 - 1} \times \cdots \times w_K^{\alpha_K - 1} \] The expected value of \(w_i\) is \(\alpha_i / \sum_{k = 1}^K \alpha_k\).

dirichlet_draw <- gtools::rdirichlet(n = 1, alpha = K_norm[1, ])

Methods for improving simulation accuracy

Time status \(2 \times 25\text{ min}\)

How could we go about making the simulations more realistic?

The realism of a given model can be divided into structural realism and parametric realism2. Structural realism refers to the distributional assumptions made. Parametric realism refers to the specific (hyper-)parameter values chosen for the assumed distributions.

Realism can be assessed by (1) simulating data from the generative model we propose, (2) plotting and summarising the simulated data, possibly using dimension reduction techniques (3) comparing to real data, or our intuitions and expectations of it. The final stage requires developing a good understanding of the real data, or even better closely collaborating with domain scientists who have this understanding. Aside from allowing us to create more accurate simulations, developing a good understanding of the data we hope to model is a worthy goal in itself. I’m imagining an iterative process where we simulate data, ask scientist if it looks good, write down what they say, try to improve it, etc.

One method for improving parametric realism is to obtain real data, then fit a model to the real data to obtain posteriors over parameters of interest. We could then use posterior predictive samples drawn from the model as our simulations. In this setting, we may want to widen the posterior somewhat, as the sample of real data we obtain is unlikely to be representative of all real data we might encounter.

Structural realism may be improved by noticing stylistic features of the real data which the simulated data does not possess. These features might include shapes of distributions, tail heaviness, skewness, sources and structure of variation, sources and structure of biases, presence of outliers or multi-modalities etc.

For further reading on this, I would suggest the preprint (and upcoming book) Bayesian workflow by Gelman et al.

Bibliography

Townes, F William. 2020. “Review of Probability Distributions for Modeling Count Data.” arXiv Preprint arXiv:2001.04343.

  1. The units don’t matter much.↩︎

  2. This distinction is somewhat arbitrary, in that there may be no clear distinction between structural and parameteric realism, but may still be useful. For example some distributions are general enough to admit other distributions as special cases when certain parameter versions are chosen e.g. the Gaussian might (perhaps unhelpfully) be thought of as a special case of the Student-t as the degrees of freedom tends to infinity.↩︎